OpenCores
no use no use 1/1 no use no use
Analysis of the profiling and software agorithms
by julius on Oct 20, 2009
julius
Posts: 363
Joined: Jul 1, 2008
Last seen: May 17, 2021
Hi all,

I've done a little bit of analysis of the profiling results, and based on the this we can consider where we replace software algorithms with calls to a hardware accelerator.

First: I generated the profiling statistics from all of the current checked in *_gmon.out files. I am not sure how appropriate this is, I have assumed all were generated using somewhat similar, and appropriate, configurations.

The following profiling output was generated using this command:

gprof ../bin/x264 `ls *gmon.out`

from the trunk/doc/x264_profiling/gmon_files path of the checked out repository. If you do the same you should see the same results as I'll discuss (and post parts of) below.

Second: If we look at the overall time-consumer list of the functions we see it's mostly the sum of absolute differences functions, and if it isn't those then it's two motion compensation functions, mc_chroma() and get_ref(), and of course the transform functions are computationally expensive too. This is not at all surprising.

Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self ms/call total ms/call name
14.96 238.06 238.06 2,315,692,330 0.00 0.00 x264_pixel_satd_8x4
13.03 445.44 207.38 741,563,641 0.00 0.00 mc_chroma
11.19 623.47 178.03 855,124,060 0.00 0.00 get_ref
5.60 712.54 89.07 54,097,960 0.00 0.00 x264_pixel_sad_x4_16x16
4.78 788.55 76.01 569,692,937 0.00 0.00 quant_4x4
4.77 864.48 75.93 170,000,269 0.00 0.00 x264_pixel_sad_x4_8x8
3.31 917.15 52.67 905,172,666 0.00 0.00 x264_pixel_satd_4x4
3.00 964.91 47.76 36,612,716 0.00 0.00 x264_pixel_sad_x3_16x16
2.99 1012.51 47.60 114,023,979 0.00 0.00 x264_pixel_sad_16x16
2.68 1055.19 42.68 135,568,775 0.00 0.00 sub8x8_dct
2.53 1095.43 40.24 115,102,227 0.00 0.00 x264_pixel_sad_x3_8x8

(the formatting might be bad a little there, sorry, and I added commas for easier reading of the call numbers)

Looking at the code of these functions shows they are not particularly complex, nor doing anything particularly lengthy. I hope I am not misinterpreting this, but we can see each call is less than 0.00 ms each, confirming this.

There is no way this is the level at which we should be calling an accelerator block. Ie, setup an accelerator to do 16 multiplications and additions/subtractions or whatever is required. This probably would yield performance increases, what would be more useful is to look at which functions are calling these functions frequently and seeing if we can convert functions from a few levels up into one large automated algorithm which we can assign to an accelerator block.

Breaking down the flow of this software to where a majority of the work occurs will tell us at which level it should be most appropriate to create calls to the accelerator blocks.

The following is from the call graph generated from the profiling output.

From the top: the function Encode_frame() does the encoding of each frame.

Inside Encode_frame(), x264_encoder_encode() is the primary work functions:

Inside x264_encoder_encode() two functions do most of the work: x264_slice_write() and x264_slicetype_decide().

Inside x264_slice_write() the functions x264_macroblock_analyse() does the lion's share, and x264_macroblock_encode(), x264_fdec_filter_row() do equal, but lesser (1/20th) amounts.

The following is the profiling output for the first few functions of x264_macroblock_analyse():

index % time self children called name
[5] 81.8 11.04 1290.36 6688170 x264_macroblock_analyse [5]
18.95 703.48 58251665/82486152 x264_me_search_ref [6]
3.49 274.40 6648822/6648822 x264_mb_analyse_inter_p16x16 [8]
0.96 171.29 5496544/5496544 x264_mb_analyse_p_rd [14]
8.80 61.14 5535892/5535892 x264_mb_analyse_intra [24]
0.22 30.23 5535892/5535892 x264_intra_rd [35]
This function takes up 81.8% of the overall run time.

Inside x264_macroblock_analyse() the work appears to begin to even out a little more, between x264_me_search_ref(), which has three times as much time as x264_mb_analyse_inter_p16x16() which in turn does about 1.5 times as much as x264_mb_analyse_p_rd which does about three times as much as x264_mb_analyse_intra() which does about twice as much as x264_intra_rd().

x264_me_search_ref() takes up 65% of overall time, with one particular function, refine_subpel() which is called each time and individually accounts for 60% of the time of x264_me_search_ref()'s time. This is profiling output for x264_me_search_ref():

64.3
index % time self children called name
[6]26.84 996.15 82486152 x264_me_search_ref [6]
25.80 619.10 82486152/82486152 refine_subpel [7]
57.82 0.00 35117060/54097960 x264_pixel_sad_x4_16x16 [21]
49.88 0.00 111674586/170000269 x264_pixel_sad_x4_8x8 [23]
47.76 0.00 36612716/36612716 x264_pixel_sad_x3_16x16 [28]
47.60 0.00 114023979/114023979 x264_pixel_sad_16x16 [29]
45.44 0.00 218278118/855124060 get_ref [13]
40.24 0.00 115102227/115102227 x264_pixel_sad_x3_8x8 [31]
25.87 0.00 225932729/248076297 x264_pixel_sad_8x8 [36]
8.47 0.00 9133756/13895171 x264_pixel_sad_x4_8x16 [48]
7.24 0.00 9046902/13764017 x264_pixel_sad_x4_16x8 [53]
6.44 0.00 9345419/9345419 x264_pixel_sad_x3_8x16 [69]
5.89 0.00 9258926/9258926 x264_pixel_sad_x3_16x8 [72]
4.36 0.00 21861433/21861433 x264_pixel_sad_8x16 [78]
4.24 0.00 21432281/21432281 x264_pixel_sad_16x8 [80]

First column showing overall time spent (seconds) in each of these functions. We can see that all of these x264_pixel_sad_*() functions get called a lot, but individually, each, don't contribute much time to the parent. Why this column is 0, I am not sure.

Let's consider this for x264_pixel_sad_x4_16x16(). Overall, this function is the 4th most time consuming, with a total of 89.07 seconds spent in it. It gets called 35117060 times from within x264_me_search_ref() of the total 54097960 times it gets called throughout the entire program. This means that we spend 64% (35117060/54097960) of the overall time we spend in this function, from calls made within x264_me_search_ref(), which is, as the first column shows here 57.82 seconds (64.913% of 89.07 seconds).

This is just one example of the many simpler functions that get called here, but with large frequency. Clearly, here is an opportunity to somehow combine the calls to these simpler algorithms into something we could assign to an accelerator block to do instead.

A similar place to insert calls to an accelerator appears to be in the next function: refine_subpel():
index % time self children called name
[7]40.525.80 619.10 82486152 refine_subpel [7]
183.18 0.00 655023448/741563641 mc_chroma [12]
132.59 0.00 636845942/855124060 get_ref [13]
5.41 95.35 463771960/573054240 x264_pixel_satd_8x8 [15]
3.05 76.93 93545695/114745985 x264_pixel_satd_16x16 [20]
31.25 0.00 18980900/54097960 x264_pixel_sad_x4_16x16 [21]
26.05 0.00 58325683/170000269 x264_pixel_sad_x4_8x8 [23]
25.07 0.00 430884793/905172666 x264_pixel_satd_4x4 [27]
Where we see large amounts of calls to x264_pixel_sad_*() functions again.

The mc_chroma() function is the second most time consuming function overall from our profiling, and get_ref(), we saw in x264_me_search_ref() too, takes up the third most amount of time in this program. These two functions have no children and are purely computational functions, doing (i think!!) inter-pixel interpolation and colour interpolation for motion compensation (processing the generated motion vectors for generating the decoded frame buffer).

So, hopefully we can somehow break out calls to an accelerator block from functions like x264_me_search_ref() and refine_subpel() to achieve some large increases in performance.

I think there needs to be a lot more investigation of the algorithms contained in functions like these, as well as any other more macroscopic level of processing in this software.

Before any specification of what possible accelerator blocks might do, we need to consider how we can take large chunks of the algorithms contained inside these functions and perform them in hardware.

I will spend time doing this in the immediate future and if others have ideas of where and how we should perform substitution of software for hardware acceleration, and what that hardware accelerator should contain, please share your ideas.

I think this is the point at which important decisions must be made about the form of the accelerator block, and where it will fit into the software.

There are also other considerations here, like how general will the resulting block really be? Unfortunately, unless we do similar analyses of other software h264 encoders we can't really tell, but this is the way the original spec suggested it be done:

"When we have a working SW implementation on the OpenRISC platform as well as the profiling results, it is time for optimization. Certain critical algorithms will be supported by HW accelerators, and the original SW for such an algorithm will be replaced by a device driver for the accelerator. "

I am also currently very close to having the x264 application running stand-alone on ORPSoC, ie not on top of Linux (you might have seen I got newlib working, this is what I'm using to compile a slightly modified x264 application). Hopefully I will soon have a patch for the version of the x264 software we are working from which allows it to run in ORPSoC completely, and from there we can perhaps do a branch of ORPSoC or similar allowing us to simulate any hardware accelerator blocks we are working on and analyse the results. I will do a post when this work is ready, so please save any discussion on that topic for the thread.

I'm most interested to hear people's thoughts on this approach I have mentioned here.

Cheers,
Julius
no use no use 1/1 no use no use
© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.